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ABSTRACT 


Course selection can be a daunting task, especially for first- 
year students. Sub-optimal selection can lead to bad per- 
formance of students and increase the dropout rate. Given 
the availability of historic data about student performances, 
it is possible to aid students in the selection of appropriate 
courses. Here, we propose a method to compose a personal- 
ized curriculum for a given student. We develop a modular 
approach that combines a context-aware grade prediction 
with statistical information on the useful temporal ordering 
of courses. This allows for meaningful course recommenda- 
tions, both for fresh and senior students. We demonstrate 
the approach using the data of the computer science Bach- 
elor students at Saarland University. 


1. INTRODUCTION 


Students at higher education institutions usually have to 
choose from a large set of possible courses in order to achieve 
an academic degree. Even for senior students, it is not ob- 
vious which courses to follow and in what sequence as the 
number of possible choices is large. Students often have 
problems to ensure progress in a program of study, espe- 
cially in the first years of study, and to graduate in a timely 
manner. 


Student success is also an important objective for decision 
makers at universities, which continuously monitor drop- 
out rates and average times to degree. Completion rates 
at European universities range between 39% to 85% and are 
highly program dependent, while the average time-to-degree 
is around 3.5 years for a Bachelor degree [17]. 


When pursuing a degree students typically have to complete 
a set of mandatory courses, as well as courses that can be 
chosen more freely. In the first years, an adequate order of 
mandatory courses is of interest while in later years the focus 
is on the question which courses to take in general and which 
not. Instead of relying on individual recommendations from 
other students, our goal is to take advantage of the combined 
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experience of former students and address both, an adequate 
temporal ordering and an intelligent selection of courses. 


We propose an approach that combines statistical methods 
based on course orderings and grade prediction based on a 
collaborative filtering approach. This results in a model con- 
sisting of two main components, a course dependency graph 
and grade prediction. Therefore our model combines two 
major criteria: The expected performance, i.e. the expected 
grade, and preparedness, i.e. how prior course choices may 
benefit the student, for a given course. We believe that 
weaving the two criteria strongly increases the usability of 
our recommendations compared to previous work focusing 
only on one of the two. 


To train our model we use long-term educational data of 
computer science Bachelor students from Saarland Univer- 
sity’s computer science department. The data consists of 
course performance information from several thousand stu- 
dents of various countries during the last ten years. Experi- 
ments with a first subset of students already showed promis- 
ing results giving recommendations for first-year as well as 
for senior students. 


2. RELATED WORK 


Many course recommendation approaches are based on per- 
formance prediction. A wide range of standard machine 
learning methods have been applied to this problem (14) [25], 
as well as recommender system techniques {10}. Ray and 
Sharma apply collaborative filtering based on item-item 
similarity. Ren et al. [9] supplement a matrix factorization 
approach with weights for recently taken courses. Besides a 
gain in predictive quality, the resulting model carries valu- 
able information on beneficial orderings of courses. Poly- 
zou and Kyrapis propose a matrix factorization based 
on course-specific features. Slim et al. use Markov net- 
works of courses to predict individual grades and estimate 
the future performances inside a study program. 


In contrast to the aforementioned approaches, our technique 
separates the concerns of performance and preparedness. 
This has the benefit of allowing for a custom weighting of 
the two components, as well as the increased explanatory 
value of the model itself. 


Much effort on curriculum planning has been focused on 
Massive Open Online Courses (MOOC). For instance, Hansen 
et al. analyse characteristic question sequences in online 
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courses by applying Markov chains to student clusters. Chen 
et al. |3) propose a squencing for items in the context of web- 
based courses. 


In the context of university eduction, much effort has been 
directed towards providing analytical tools to educators and 
institutions. For example Zimmermann et al. predict 
graduate performance, based on the students’ undergradu- 
ate performances. Saarela and Karkkéinen analyse un- 
dergraduate student data to indentify relevant factors for a 
successful computer science education. 


3. PROBLEM SETTING 


We consider the problem of designing a student’s curriculum 
that optimizes performance (measured in terms of course 
grades) and the time to degree. Hence, for each semester a 
subset of the courses offered is chosen such that the student’s 
complete trace from the first semester until the final degree 
is (approximately) optimal, i.e., the performance and time to 
degree does not improve if the order in which the courses are 
taken is changed or if different courses are taken. We assume 
that a large number of traces of former students are given, 
including the particular grades achieved in each course. Note 
that this also includes data of students retaking courses are 
failing. However, the data may not provide information about 
students that enroll in a course but withdraw before the 
final exam. In addition, we assume that recommendations 
for students that already participated in certain courses, the 
corresponding partial trace is available as well as meta data 
about the student. Moreover, we want to take into account 
all selection rules of the corresponding study program. 


The data-set consists of performance and meta-information 
of the students at the computer science department of Saar- 
land University since 2006. It includes grades, basic infor- 
mation regarding students (age, nationality, sex, course of 
studies) as well as basic information regarding the lecture 
(course type, lecturer). Here, we consider a subset of 72 re- 
curring courses which have a total of 16,090 entries of 1,700 
students. A challenge regarding this particular data set is 
the fact that students may register fairly late in the semester 
for a particular course. Therefore the data does not capture 
a early student drop out. 


4. COMPONENTS OF OUR APPROACH 


In the context of standard recommender systems, the pre- 
dicted rating is the basis for a recommendation. However, in 
the context of course recommendation, further aspects, such 
as the knowledge gain and constraints of the study program 
have to be taken into account. Here, we present an approach 
that is flexible enough to also incorporate such criteria in a 
modular way. Moreover, in our approach selection criteria 
can further be prioritized by the student. A student may, 
for example, prioritize taking a course that increases the pre- 
paredness for certain other courses. In this case, the course 
may be recommended although the students performance 
alone did not lead to suggestion of that course. 


We construct a personalized recommendation graph of courses 
for each student based on the two main components: the 
course dependency graph and the performance prediction. 
The course dependency graph aims to capture the positive 
effect that course A has on the performance in course B. The 


performance prediction is done using a collaborative filtering 
approach, that incorporates contextual features of both the 
student and the course. 


4.1 Course Dependency Graph 

The Course Dependency Graph is a graph whose node set 
equals the set of all (regularly or irregularly offered) courses. 
A directed edge between course A and course B means that 
when passing A before B then the chance of getting a better 
grade in B is higher compared to the grade in B obtained 
for the order B before A. 


We use the Mann-Whitney U-test to construct such a 
graph of courses. The hypothesis of the test is that one ran- 
dom variable is smaller than another. If we let the random 
variable X<. denote the grade in course B for a student 
that had a grade < c in course A an edge represents the 
hypothesis 


Pr(X<c < k) > Pr(Xs- < k), 


where X>- includes the case of not taking course A. The 
hypothesis describes that the probability of drawing a grade 
of subset X<- which is better than k is higher than doing 
the same for subset X>,..We fix a small significance level 
a = 0.0001, to find the most important course relations. 
Since the test is quite sensitive, it tends to identify too many 
course pairs for higher significance levels. Moreover, a min- 
imum number of 20 samples is required for each case to 
perform the test. The graph only contains an edge between 
two courses if the test confirms the above hypothesis. 


In Germany grades are numbers in the set 
P = {1,1.3, 1.7, 2, 2.3, 2.7, 3, 3.3, 3.7, 4, 5}, 


where lower numbers are better and 5 is the failing grade. 
In general, we assume these performances to be normalized 
to mean zero and unit variance w.r.t. courses. 


To construct the course dependency graph, we first construct 
one graph for each grade threshold c € P. Next we average 
over the edges of all graphs, resulting in edge weights be- 
tween 0 and 1. In this way the final graph, in which course 
dependency is not binary but a weighting, is more informa- 
tive. A large value implies that this course ordering is bene- 
ficial to students of all performance levels while a low value 
indicates that this ordering is only helpful for a smaller set 
of students. Note that the absence of edges indicates that 
there is not enough information about the relation between 
the two courses. 


An excerpt of a course dependency graph is shown in Fig- 
ure [I] We find that ‘Programming I’, ‘Maths I’ and ‘Maths 
IT’ are good starting points in this graph for a first-year stu- 
dent as they do not have incoming edges. Note that the miss- 
ing edge between ‘Maths I’ and ‘Maths IT’ is meaningful as 
‘Maths I’ focuses on Linear Algebra while ‘Maths II’ is con- 
cerned with Analysis. As opposed to this, for ‘Programming 
I’ and ‘II’ the graph suggests to first take ‘Programming I’ 
as a preparation, which is a meaningful recommendation as 
the contents of ‘Programming II’ are based on those of ‘Pro- 
gramming I’. Moreover, the graph shows a number of less 
obvious relations between courses (e.g. ‘Programming II’ 
and ‘Theoretical Computer Science’). 


Proceedings of the 11th International Conference on Educational Data Mining 247 


Programming | 


Maths | 


0.37 


0.82 


System architecture Programming II 


0.2 


Theoretical CS Software lab 


Maths II Information systems 


Figure 1: Excerpt of a course dependency graph, based on 
Mann-Whitney U-test with significance level of 0.0001, rep- 
resenting the dependencies between most of the basic courses 
in CS curriculum at Saarland University. 


4.2 Grade prediction 

We use a collaborative filtering approach to predict stu- 
dent performance. One advantages of this approach is that 
no imputation of missing entries is necessary but the opti- 
mization only runs over existing entries. 


We associate with each student i and course j an n-dimensio- 
nal feature vector, s; and c;, respectively. The predicted 
performance is the cross-product of both vectors, i.e. 


Fi) Gane) = ¥ sine, 
k=], 


which we call the predictor function. Let gi,; be the perfor- 
mance of a student 7 in course j and let G: denote the set of 
all known performances of students up to semester t. Then 
the standard loss is the regularized MSE, i.e. 


L(S,C,t)= S> (F(,5) — 94,9)? + AR(S,C) 


9i,j7 EGt—-1 


with regularization term 


h(S,C) = So llsell + Dollesll 


ies GEC 


where S is the set of all students and C the set of all courses. 


4.2.1 Contextual Information 

The above loss metric only depends on information about 
the students’ performances, i.e. their grades. However, the 
context of a performance can contain vital information. Usu- 
ally, in the context of student records a wealth of data is 
readily available. This includes meta-data of a student such 
as age, gender, and nationality and data regarding the pro- 
gression of the student throughout study programs. More- 
over, information regarding the course, such as the lecturer, 
is typically known. 


A standard and straight-forward, approach to include such 
information is to pre-filter data [10}. This entails partition- 


ing data along contextual criteria and then training a model 
for each subset. Here, the only performed pre-filtering is to 
take only computer science Bachelor students into account. 
Other partitionings, e.g. partitioning along the semester, 
have not improved predictive quality. 


Further contextual information is included explicitly in the 
model as follows. The predictor f is augmented by linear 
terms for contextual features. Categorical features, such 
as teachers, are one-hot encoded. Continuous features are 
centered to zero mean and unit variance. In principle we 
can introduce these additional linear parameters for both, 
courses and students, but it turns out that the best results 
are achieved if we associate features with courses. Given 
the large number of contextual features it proved advanta- 
geous to set up a feature selection pipeline in which certain 
features are identified for each course. Specifically, features 
were identified by using a 5-fold cross-validated recursive 
feature elimination. Therein features are iteratively removed 
according to their coefficient in a linear model. The cross- 
validation is used to determine the number of features kept. 
Thus, the predictor becomes 


F(é,5,t) = (si, 3) + (etx (i, 9, t),¢F°) , 


where ctz is the performance context according to the above 
feature selection pipeline. Consequently, the parameter vec- 
tor for course 7 becomes 


~ _ ctx ctx 

Cp (Cp CF,85 <2 5 Cina Chass Crag) 
and m, is the number of features selected for the context of 
a performance in course j. 


Another key property to be considered when working with 
past performances is the temporal distance to the current 
time. A performance achieved one semester ago should be 
considered more important than one five semesters ago (9). 
Therefore it is natural to add a temporal decay to the loss 
function. Considering the semester t’ of a specific perfor- 
mance g;,;,47, we can multiply an exponential decay function. 
Thus, the now time-dependent loss is 


Us cy=> 2°" (FG,3, t) — gia) 


Gi,g,t! EGy_1 
+AR(S,C), (1) 


where a > 0 is the temporal decay parameter. 


4.2.2 Minimization 

The non-linear minimization problem in Eq. is of high 
dimensionality because of the parameter vectors s; and c; 
fori € S,7 € C. It can most effectively be achieved using 
stochastic gradient descent techniques with adaptive learn- 
ing rates, because for this approach course vectors stabi- 
lize more quickly. Specifically, we used the Adagrad algo- 
rithm 4, which avoids strong alteration of frequently con- 
sidered parameters, which is the case for many course pa- 
rameters, while seldomly encountered parameters may be 
altered more, which is fitting for student parameters. We 
fixed a batch size of 1000 and performed 500,000 iterations 
of the algorithm. Each minimization is performed for 5 dif- 
ferent initial random values. The value according to the 
smallest training loss is selected. This was performed for all 
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semesters in a grid search over different dimensionality pa- 
rameters and regularization parameters, i.e. for parameter 
tuples (A, 7). Before minimization the data was normalized 
along the lectures to zero mean and unit variance. 


4.2.3 Evaluation 

The most natural approach to evaluate the model is to split 
the data by semesters. Given a fixed semester t the data up 
to (including) semester t — 1, i.e. Ge—1, is used as a training 
set. The data of semester t, i.e. G: \ Ge-1 is used as a test 
set. 

The measures of quality we use are the mean absolute er- 
ror (MAE) and the root mean square error (RMSE). As a 
baseline we provide the RMSE and MAE for the mean pre- 
dictor with respect to both, the students and the courses in 
Table[i] 

In the evaluation of the context-free model, we see, that 
low-dimensional models (i.e. models with only few features) 
perform best. The absolute values of these errors are fur- 
ther improved by pre-filtering the data considered. If, for 
example, only Bachelor computer science students are con- 
sidered the test error decreases. The decay factor leads to 
an improvement. For example, for n = 1 and » = 0.1 the 
MAE decreases from 0.856 to 0.852. In Figure [2] the pre- 
diction results for the importance decay a = 0.1 are shown. 
Given this loss function, the one-dimensional, less regular- 
ized model outperforms the others in terms of both, the 
MAE and the RMSE. The inclusion of contextural informa- 
tion leads to a further reduction, such that for n = 1 and 
= 0.1 the MAE is 0.8459, while the RMSE is 1.0904. 


Table 1: The RMSE and MAE for the mean predictors along 
the student and the course axis, respectively. 


MAE RMSE 
course 1.1130 =1.3311 
student 0.9268 1.1883 


5. RECOMMENDATION SYNTHESIS 


The recommendation combines the course dependency graph, 
the grade prediction, and constraints based on the study 
regulation in order to compute a recommendation score. A 
larger score corresponds to a stronger recommendation. 


5.1 Combining the Components 

The recommendation score for a course 7 w.r.t. a student 
i combines several criteria, namely the preparedness for j, 
the general merit of 7, and the predicted performance of i 
in course 7. 


Let R; denote the set of courses that student 7 has finished 
within the last t semesters. Now, for each course j € C'\ Ri, 
we sum over the weights of the edges of the course depen- 
dency graph that start in some course j’ € R; and end 
in j. This value is an approximation for the preparedness 
pi,j € R>o of the student w.r.t. course 7. 


For the general merit of a course, we use the out-degree of 
the course deg*(j) in the graph as an approximation of its 
benefit towards other courses. Note that this criteria is espe- 
cially relevant for first-year students as for them nodes with 


= 0.90 
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Figure 2: The MAE (a) and RMSE (b) for different dimen- 
sionalities n and regularization parameters A. The models 
were trained and tested on Bachelor CS students only. The 
loss is weighted by time with a = 0.1. 


higher out-degree provide a good starting point. Further 
note that for such students, R; = @ and the grade predic- 
tion can only give average values as no information about 
their previous performance is available. 


To incorporate information about the predicted performance, 
we transform the predicted grades g;,;, such that good grades 
map to large values and poor grades to small values, i.e., we 
consider the value (5 — gi,;)/4 € [0, 1]. 


We parameterize these factors into a linear model, that gives 
us a raw, unfiltered recommendation value 


Tij = CpPi,g + Cg(5 — Gi,g)/4 + cm deg* (J), (2) 


where Cp,Cg,Cm € [0,1] provide a weighting for the three 
factors, ie., Cp + €g tem = 1. 


We finally filter the recommendations as follows. The choice 
of courses is constrained by study regulations. Thus, for a 
given student 7, some courses may not contribute towards 
completion of the program or she may not be able to enroll 
in them (‘not allowed’). Thus, the final recommendation 
value is a product of the raw value i ; and a function value 
reg(i, 7), where 


1 j is part of program 
reg(i,j) = § 0 
ce(i) otherwise 


j not allowed 


This introduces a further parameter c(i) € [0,1] associated 
with courses that are not necessary to achieve the degree but 
may lead to an improvement of the final grade or may be 
interesting to the student. E.g. a student of bioinformatics 
may choose ce(i) = 0.5 to get also recommendations for com- 
puter science courses that are not part of the bioinformatics 
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Figure 3: Example of a recommendation graph, based on the 
dependency graph given in [I] ‘Programming I’ and ‘Maths 
I’ have been passed already and the edge weights have been 
updated accordingly. The recommendation values were com- 
puted with cp = 0.76, cg = 0.21 and cm = 0.03. 


System architecture 


Maths II 


program. However, the default value is ce(t) = 0. 


Hence, the overall recommendation value of course 7 is 


Tig = (Cppi,g + Co(B — G:,3)/4 + em deg* (J) reg(i, 7) (3) 


with weight parameters by Cp, Cg, Cm. 


To illustrate the influence of the different factors, we con- 
sider the following example. Suppose a first-year student 
in the winter semester uses the system to compose his first 
curriculum. We do not have any performance knowledge 
about the student, so this is a cold start scenario. Recon- 
sider the dependency graph in Figure[]] Because of the high 
out-degrees, we recommend ‘Programming I’, ‘Maths I’ and 
‘Theoretical CS’. The student successfully attends the first 
two of these courses in the following winter semester. Now 
we are able to incorporate the achieved grades in our pre- 
diction model. The now computed recommendation values 
per course are visualized as star graph shown in Figure [3] 
Finally a valid suggestion for the next semester based on 
the recommendation values is a combination of ‘Program- 
ming II’, ‘Information systems’ and ‘Maths II’. In general, 
at the beginning of every semester, we can provide the stu- 
dent with a personalized curriculum by compiling a list of 
lectures based on their recommendation score. 


5.2 Evaluation 

We now assess how similar our recommendation are to the 
actually selected courses of the students. Again, we separate 
the student data by semesters, such that recommendations 
are only based on data of previous semesters. To define the 
metric, let T be the set of semesters, S; the set of students 
who took some course in semester £ € 7. Further, given 
some semester t, let Ce be the set of courses in which stu- 
dent i was enrolled and let C%:. be the set of recommended 
courses for student 7. We adopt a top-k recommendation 
policy in which we recommend only the k courses with the 
highest recommendation value. Moreover, we only take into 
account lectures which were available in the given semester 
and study program. 


To approximate the conformity of our recommendations we 
consider the conformity score 


1 min(k, |Csel) — (Cree 9 Coe) 
1 is Sei - SE 
|T| + |Sel a2, min(k, |C’!]) 


sel 


? 


tET iEcSz 


where the second term calculates the average ratio of the 
number of courses that have been selected by the student 
but were not recommended or that were recommended but 
not selected. So we end up with a score, indicating the con- 
gruency of our recommendations with the student’s actual 
course selections. 


We evaluated the conformity score w.r.t. several combina- 
tions of the recommendation parameter values of Eq. (3). 
The considered recommendation sizes are 4 and 6 courses, 
since for most students this is a realistic balance between 
study progression and manageable a workload.Since we are 
interested in the relationship between the conformity score 
and the distribution of the parameters, in the first place we 
either fix cp or cg to 1 while the rest stays at zero which 
captures the performance of a single component of our ap- 
proach. Moreover, we look for the best combination of both, 
course dependency graph (c,) and grade prediction (c,). 
The third parameter cm = 1—cp—cg results from the choice 
of the first two, which makes the search two-dimensional. 


Our results in Table [2] show that with increasing k the con- 
formity grows as more courses are recommended. The first 
two columns of the table point out that the course depen- 
dency graph has a higher explanatory value for the recom- 
mendation than the grade prediction. A recommendation 
only based on the performance hardly achieves a value ex- 
ceeding 50 percent while course dependency alone reaches 70 
percent. Therefore it is clear that cp has to be determined 
significantly larger than c,. This observation is approved 
within the third column as in all top-k recommendations we 
reached the best conformity with cp ~ 0.76, cg © 0.21 and 
Cm © 0.03. 


According to these scores our recommendations and the choi- 
ces of the students have an average overlap of about 70 
percent. Hence, there are recommended courses that the 
student did not choose. An example for this case is given 
by the core lecture ‘Embedded Systems’. We recommended 
this course to 89 students, while only 4 of them actually 
took the course in the corresponding semester. As opposed 
to mathematically demanding lectures such as ‘Complexity 
Theory’, which is only recommended for a small set of very 
strong students, this course seems to be a good choice for 
many students but is taken only by few. Moreover, in one 
semester the number of recommendations for basic courses 
was about 200 while only 90 students actually attended the 
courses. This could be related to the fact that many stu- 
dents withdraw from courses after a few weeks when they 
feel that the course is too demanding for them. In this case, 
the data does not show their trial for this course. 


6. CONCLUSION 


We proposed an approach that gives personalized course rec- 
ommendations for students in order to improve the obtained 
grades and to decrease the time-to-degree. We combined a 
course dependency graph and performance predictions to 
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Table 2: The conformity score for different valuations of the 
recommendation value parameters (Cp,Cg,Cm) and different 
top-k recommendation policies. 


top-k (Cp, Cg) = (1, 0) (Cp, Cg) = (0, 1) (Cp, Cg)” 
4 0.5913 0.3857 0.6349 
i) 0.6580 0.4564 0.6962 
6 0.7138 0.5326 0.7432 


determine a recommendation value for each course. We as- 
sumed that only the top-k courses are given as a personal- 
ized curriculum for a student and tested their conformity to 
the actually selected courses of the student. This, however, 
does not indicate that our approach significantly improves 
the students’ grades or time-to-degree as we expect that stu- 
dents do not make optimal choices. 


An interesting insight from our results is that the course de- 
pendency graph seems better suited for course recommen- 
dation than grade prediction even though it is only based 
on aggregated information and does not consider any meta 
data. From this result it seems that students tend to focus 
more on a course ordering that older students established 
rather then selecting according to their own confidence or 
skill.Another interesting result is the large overlap (around 
70 percent) of recommended and chosen courses. Moreover, 
some courses are not taken by students even though our 
model indicates that they would lead to an improvement in 
performance. 


The model itself is flexible in the sense that one can easily 
adjust or extend it by changing the recommendation formula 
and/or incorporate more information to make the grade pre- 
diction more precise. A possible extension is the integration 
of more personalized information given by the student before 
calculating their recommendations. For example a student 
is more interested in practical lectures, so she uses an in- 
terface to let the system know. Thus, we would be able to 
give courses of this category a positive effect on their rec- 
ommendation value. The challenge here is to separate the 
courses into appropriate categories, since the way a course is 
designed strongly depends on the lecturer and other factors. 


To evaluate the system, it would be interesting to monitor 
a sufficiently large number of students during their stud- 
ies that choose only recommended courses or at least is ex- 
posed to the course recommendations. An easier evaluation 
would be possible with a simulation of hypothetical student 
traces according to our grade prediction approach, where in 
each semester we assume that a student chooses only rec- 
ommended courses. 
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